Random forest versus logistic regression: a large-scale benchmark experiment

نویسندگان

  • Raphael Couronné
  • Philipp Probst
  • Anne-Laure Boulesteix
  • Raphael Couronne
چکیده

The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. In this context, we present a large scale benchmarking experiment based on 260 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases. RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.032 (95%-CI=[0.025, 0.042]) for the accuracy, 0.043 (95%-CI=[0.032, 0.056]) for the Area Under the Curve, and −0.028 (95%-CI=[−0.036,−0.022]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were highly dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values. ∗[email protected][email protected][email protected]

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Random Forest and Logistic Regression Methods in Predicting Mortality in Colorectal Cancer Patients and its Related Factors

Background and Objectives: The purpose of this study was to predict the mortality rate of colorectal cancer in Iranian patients and determine the effective factors  on the mortality of patients with colorectal cancer using random forest and logistic regression methods.   Methods: Data from 304 patients with colorectal cancer registry from the Gastroenterology and Liver Research Center of Shah...

متن کامل

Extreme Logistic Regression: A Large Scale Learning Algorithm with Application to Prostate Cancer Mortality Prediction

With the recent popularity of electronic medical records, enormous amount of medical data is being generated every day at an exponential rate. Machine learning methods have been shown in many studies to be capable of producing automatic medical diagnostic models such as automated prognostic models. However, many powerful machine learning algorithms such as support vector machine (SVM), Random F...

متن کامل

Joint Maximum Purity Forest with Application to Image Super-Resolution

In this paper, we propose a novel random-forest scheme, namely Joint Maximum Purity Forest (JMPF), for classification, clustering, and regression tasks. In the JMPF scheme, the original feature space is transformed into a compactly pre-clustered feature space, via a trained rotation matrix. The rotation matrix is obtained through an iterative quantization process, where the input data belonging...

متن کامل

Susceptibility Zoning of Dust Source Areas by Data Mining Methods over Khorasan Razavi Province

Extended abstract Introduction Dust storms are natural hazards that effect on weather conditions, human health and ecosystem. Atmospheric processes are directly affected by the absorption and diffusion of radiation by dust, and dust in the cloud acts as a nucleus of congestion. The main dust areas in the world are flat topographically dry areas with erosion-sensitive soil and poor vege...

متن کامل

A Robust Missing Value Imputation Method MifImpute For Incomplete Molecular Descriptor Data And Comparative Analysis With Other Missing Value Imputation Methods

Missing data imputation is an important research topic in data mining. Large-scale Molecular descriptor data may contains missing values (MVs). However, some methods for downstream analyses, including some prediction tools, require a complete descriptor data matrix. We propose and evaluate an iterative imputation method MiFoImpute based on a random forest. By averaging over many unpruned regres...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017